8 research outputs found
LOD-Connected Offensive Language Ontology and Tagset Enrichment
CC BY 4.0The main focus of the paper is the definitional revision and enrichment of offensive language typology,
making reference to publicly available offensive language datasets and testing them on available pretrained
lexical embedding systems. We review over 60 available corpora and compare tagging schemas
applied there while making an attempt to explain semantic differences between particular concepts of
the category OFFENSIVE in English. A finite set of classes that cover aspects of offensive language representation
along with linguistically sound explanations is presented, based on the categories originally
proposed by Zampieri et al. [1, 2] in terms of offensive language categorization schemata and tested by
means of Sketch Engine tools on a large web-based corpus. The schemata are juxtaposed and discussed
with reference to non-contextual word embeddings FastText, Word2Vec, and Glove. The methodology
for mapping from existing corpora to a unified ontology as presented in this paper is provided. The proposed
schema will enable further comparable research and effective use of corpora of languages other
than English. It will also be applied in building an enriched tagset to be trained and used on new data,
with the application of recently developed LLOD techniques [3]
A survey of guidelines and best practices for the generation, interlinking, publication, and validation of linguistic linked data
This article discusses a survey carried out within the NexusLinguarum COST Action which aimed to give an overview of existing guidelines (GLs) and best practices (BPs) in linguistic linked data. In particular it focused on four core tasks in the production/publication of linked data: generation, interlinking, publication, and validation. We discuss the importance of GLs and BPs for LLD before describing the survey and its results in full. Finally we offer a number of directions for future work in order to address the findings of the survey
An OWL ontology for ISO-based discourse marker annotation
Purpose: Discourse markers are linguistic cues that indicate how an utterance relates to the discourse context and what role it plays in conversation. The authors are preparing an annotated corpus in nine languages, and specifically aim to explore the role of Linguistic Linked Open Data (/LLOD) technologies in the process, i.e., the application of web standards such as RDF and the Web Ontology Language (OWL) for publishing and integrating data. We demonstrate the advantages of this approach
Validation of language agnostic models for discourse marker detection
Using language models to detect or predict the
presence of language phenomena in the text has
become a mainstream research topic. With the
rise of generative models, experiments using
deep learning and transformer models trigger
intense interest. Aspects like precision of predictions,
portability to other languages or phenomena,
scale have been central to the research
community. Discourse markers, as language
phenomena, perform important functions, such
as signposting, signalling, and rephrasing, by
facilitating discourse organization. Our paper
is about discourse markers detection, a complex
task as it pertains to a language phenomenon
manifested by expressions that can occur as
content words in some contexts and as discourse
markers in others. We have adopted
language agnostic model trained in English to
predict the discourse marker presence in texts
in 8 other unseen by the model languages with
the goal to evaluate how well the model performs
in different structure and lexical properties
languages. We report on the process of
evaluation and validation of the model's performance
across European Portuguese, Hebrew,
German, Polish, Romanian, Bulgarian, Macedonian,
and Lithuanian and about the results
of this validation. This research is a key step
towards multilingual language processing
TED-ELH Parallel Corpus (ELEXIS)
The corpus contains parallelly aligned scripts of TED Talks in English, Lithuanian, and Hebrew. It contains spoken language data.
See also: http://hdl.handle.net/20.500.11821/3